智能论文笔记

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Suvodeep Majumder , Joymallya Chakraborty , Tim Menzies

分类：机器学习

2022-11-10

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models, but there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even that, those tests have been on just a handful of projects. This paper takes a wide range of 55 semi-supervised learners and applies these to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. However, co-training needs to be used with caution since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). Those cautions stated, we find using these "co-trainers," we can label just 2.5% of data, then make predictions that are competitive to those using 100% of the data. It is an open question worthy of future work to test if these reductions can be seen in other areas of software analytics. All the codes used and datasets analyzed during the current study are available in the https://GitHub.com/Suvodeep90/Semi_Supervised_Methods.

translated by 谷歌翻译

Can We Achieve Fairness Using Semi-Supervised Learning?

Joymallya Chakraborty , Huy Tu , Suvodeep Majumder , Tim Menzies

分类：机器学习

2021-11-03

在机器学习模型道德偏见已经成为软件工程界关注的一个问题。大多数现有软件工程的作品集中在模型寻找道德偏见，而不是修复它。发现偏差后，下一步就是缓解。在此之前研究人员主要是试图利用监督的方法来实现公平。与值得信赖的地面实况然而，在现实世界中，获得的数据是具有挑战性的，也基本事实可以包含人为偏差。半监督学习是一种机器学习技术，其中，递增地，标记的数据被用于生成伪标签中的数据的剩余部分（然后全部数据被用于模型训练）。在这项工作中，我们采用四种常用的半监督技术作为伪贴标创造公平分类模型。我们的框架，公平SSL，需要标记的数据的一个非常小的量（10％）作为输入，并为未标记的数据生成伪标签。然后，我们综合生成新的数据点，以平衡基础类，并提议Chakraborty等人的保护属性的训练数据。在2021年FSE最后，分类模型被训练在平衡伪标记的数据和测试数据进行了验证。实验十项数据集和三个学生后，我们发现，公平SSL实现了性能先进设备，最先进的三个偏置抑制算法类似。这就是说，公平SSL的明显优势在于，它仅需要10％的标记的训练数据。据我们所知，这是在半监督技术被用来针对SE型号ML道德偏见争第一SE工作。

translated by 谷歌翻译

FairBalance: Improving Machine Learning Fairness on MultipleSensitive Attributes With Data Balancing

Zhe Yu , Joymallya Chakraborty , Tim Menzies

分类：机器学习

2021-07-17

本文旨在改善多敏感属性的机器学习公平。自机学习软件越来越多地用于高赌注和高风险决策，机器学习公平吸引了越来越多的关注。大多数现有的机器学习公平解决方案一次只针对一个敏感的属性（例如性别），或者具有魔法参数来调整，或者具有昂贵的计算开销。为了克服这些挑战，我们在培训机器学习模型之前，我们建议平衡每种敏感属性的培训数据分布。我们的研究结果表明，在低计算开销的情况下，在低计算开销的情况下，Fairbalancy可以在每一个已知的敏感属性上显着减少公平度量（AOD，EOD和SPD），如果对预测性能有任何损坏，则可以在没有多大的情况下进行任何已知的敏感属性。此外，FairbalanceClass是非游价的变种，可以平衡培训数据中的班级分布。通过FairbalanceClass，预测将不再支持多数阶级，从而在少数阶级获得更高的F $ _1 $得分。 Fairbalance和FairbalanceClass还以预测性能和公平度量而言，在其他最先进的偏置缓解算法中也优于其他最先进的偏置缓解算法。本研究将通过提供一种简单但有效的方法来利用社会来改善具有多个敏感属性数据的机器学习软件的公平性。我们的结果还验证了在具有无偏见的地面真理标签上的数据集上的假设，学习模型中的道德偏置在很大程度上属于每个组内具有（2）类分布中的组大小和（2）差异的训练数据。

translated by 谷歌翻译

Probabilistic machine learning based predictive and interpretable digital twin for dynamical systems

Tapas Tripura , Aarya Sheetal Desai , Sondipon Adhikari , Souvik Chakraborty

分类： (统计)机器学习 | 机器学习

2022-12-19

A framework for creating and updating digital twins for dynamical systems from a library of physics-based functions is proposed. The sparse Bayesian machine learning is used to update and derive an interpretable expression for the digital twin. Two approaches for updating the digital twin are proposed. The first approach makes use of both the input and output information from a dynamical system, whereas the second approach utilizes output-only observations to update the digital twin. Both methods use a library of candidate functions representing certain physics to infer new perturbation terms in the existing digital twin model. In both cases, the resulting expressions of updated digital twins are identical, and in addition, the epistemic uncertainties are quantified. In the first approach, the regression problem is derived from a state-space model, whereas in the latter case, the output-only information is treated as a stochastic process. The concepts of It\^o calculus and Kramers-Moyal expansion are being utilized to derive the regression equation. The performance of the proposed approaches is demonstrated using highly nonlinear dynamical systems such as the crack-degradation problem. Numerical results demonstrated in this paper almost exactly identify the correct perturbation terms along with their associated parameters in the dynamical system. The probabilistic nature of the proposed approach also helps in quantifying the uncertainties associated with updated models. The proposed approaches provide an exact and explainable description of the perturbations in digital twin models, which can be directly used for better cyber-physical integration, long-term future predictions, degradation monitoring, and model-agnostic control.

translated by 谷歌翻译

An ensemble neural network approach to forecast Dengue outbreak based on climatic condition

Madhurima Panja , Tanujit Chakraborty , Sk Shahid Nadim , Indrajit Ghosh , Uttam Kumar , Nan Liu

分类：机器学习

2022-12-16

Dengue fever is a virulent disease spreading over 100 tropical and subtropical countries in Africa, the Americas, and Asia. This arboviral disease affects around 400 million people globally, severely distressing the healthcare systems. The unavailability of a specific drug and ready-to-use vaccine makes the situation worse. Hence, policymakers must rely on early warning systems to control intervention-related decisions. Forecasts routinely provide critical information for dangerous epidemic events. However, the available forecasting models (e.g., weather-driven mechanistic, statistical time series, and machine learning models) lack a clear understanding of different components to improve prediction accuracy and often provide unstable and unreliable forecasts. This study proposes an ensemble wavelet neural network with exogenous factor(s) (XEWNet) model that can produce reliable estimates for dengue outbreak prediction for three geographical regions, namely San Juan, Iquitos, and Ahmedabad. The proposed XEWNet model is flexible and can easily incorporate exogenous climate variable(s) confirmed by statistical causality tests in its scalable framework. The proposed model is an integrated approach that uses wavelet transformation into an ensemble neural network framework that helps in generating more reliable long-term forecasts. The proposed XEWNet allows complex non-linear relationships between the dengue incidence cases and rainfall; however, mathematically interpretable, fast in execution, and easily comprehensible. The proposal's competitiveness is measured using computational experiments based on various statistical metrics and several statistical comparison tests. In comparison with statistical, machine learning, and deep learning methods, our proposed XEWNet performs better in 75% of the cases for short-term and long-term forecasting of dengue incidence.

translated by 谷歌翻译

MAntRA: A framework for model agnostic reliability analysis

Yogesh Chandrakant Mathpati , Kalpesh Sanjay More , Tapas Tripura , Rajdip Nayek , Souvik Chakraborty

分类：机器学习 | (统计)机器学习

2022-12-13

We propose a novel model agnostic data-driven reliability analysis framework for time-dependent reliability analysis. The proposed approach -- referred to as MAntRA -- combines interpretable machine learning, Bayesian statistics, and identifying stochastic dynamic equation to evaluate reliability of stochastically-excited dynamical systems for which the governing physics is \textit{apriori} unknown. A two-stage approach is adopted: in the first stage, an efficient variational Bayesian equation discovery algorithm is developed to determine the governing physics of an underlying stochastic differential equation (SDE) from measured output data. The developed algorithm is efficient and accounts for epistemic uncertainty due to limited and noisy data, and aleatoric uncertainty because of environmental effect and external excitation. In the second stage, the discovered SDE is solved using a stochastic integration scheme and the probability failure is computed. The efficacy of the proposed approach is illustrated on three numerical examples. The results obtained indicate the possible application of the proposed approach for reliability analysis of in-situ and heritage structures from on-site measurements.

translated by 谷歌翻译

A Neural ODE Interpretation of Transformer Layers

Yaofeng Desmond Zhong , Tongtao Zhang , Amit Chakraborty , Biswadip Dey

分类：机器学习 | 人工智能

2022-12-12

Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. As the transformer layers use residual connections to avoid the problem of vanishing gradients, they can be viewed as the numerical integration of a differential equation. In this extended abstract, we build upon this connection and propose a modification of the internal architecture of a transformer layer. The proposed model places the multi-head attention sublayer and the MLP sublayer parallel to each other. Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks. Moreover, for the image classification task, we show that using neural ODE solvers with a sophisticated integration scheme further improves performance.

translated by 谷歌翻译

Multimodal Query-guided Object Localization

Aditay Tripathi , Rajath R Dani , Anand Mishra , Anirban Chakraborty

分类：计算机视觉

2022-12-01

Consider a scenario in one-shot query-guided object localization where neither an image of the object nor the object category name is available as a query. In such a scenario, a hand-drawn sketch of the object could be a choice for a query. However, hand-drawn crude sketches alone, when used as queries, might be ambiguous for object localization, e.g., a sketch of a laptop could be confused for a sofa. On the other hand, a linguistic definition of the category, e.g., a small portable computer small enough to use in your lap" along with the sketch query, gives better visual and semantic cues for object localization. In this work, we present a multimodal query-guided object localization approach under the challenging open-set setting. In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object (also known as gloss), to perform object localization. Multimodal query-guided object localization is a challenging task, especially when a large domain gap exists between the queries and the natural images, as well as due to the challenge of combining the complementary and minimal information present across the queries. For example, hand-drawn crude sketches contain abstract shape information of an object, while the text descriptions often capture partial semantic information about a given object category. To address the aforementioned challenges, we present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals relevant to the input queries and a novel orthogonal projection-based proposal scoring technique that scores each proposal with respect to the queries, thereby yielding the final localization results. ...

translated by 谷歌翻译

Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits

Sunrit Chakraborty , Saptarshi Roy , Ambuj Tewari

分类： (统计)机器学习 | 机器学习

2022-11-11

We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling (TS) algorithm, using special classes of sparsity-inducing priors (e.g. spike-and-slab) to model the unknown parameter, and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high dimensional and sparse contextual bandits. For faster computation, we use spike-and-slab prior to model the unknown parameter and variational inference instead of MCMC to approximate the posterior distribution. Extensive simulations demonstrate improved performance of our proposed algorithm over existing ones.

translated by 谷歌翻译

CoNMix for Source-free Single and Multi-target Domain Adaptation

Vikash Kumar , Rohit Lal , Himanshu Patil , Anirban Chakraborty

分类：机器学习 | 人工智能 | 计算机视觉

2022-11-07

This work introduces the novel task of Source-free Multi-target Domain Adaptation and proposes adaptation framework comprising of \textbf{Co}nsistency with \textbf{N}uclear-Norm Maximization and \textbf{Mix}Up knowledge distillation (\textit{CoNMix}) as a solution to this problem. The main motive of this work is to solve for Single and Multi target Domain Adaptation (SMTDA) for the source-free paradigm, which enforces a constraint where the labeled source data is not available during target adaptation due to various privacy-related restrictions on data sharing. The source-free approach leverages target pseudo labels, which can be noisy, to improve the target adaptation. We introduce consistency between label preserving augmentations and utilize pseudo label refinement methods to reduce noisy pseudo labels. Further, we propose novel MixUp Knowledge Distillation (MKD) for better generalization on multiple target domains using various source-free STDA models. We also show that the Vision Transformer (VT) backbone gives better feature representation with improved domain transferability and class discriminability. Our proposed framework achieves the state-of-the-art (SOTA) results in various paradigms of source-free STDA and MTDA settings on popular domain adaptation datasets like Office-Home, Office-Caltech, and DomainNet. Project Page: https://sites.google.com/view/conmix-vcl

translated by 谷歌翻译